81 research outputs found

    GPU-Accelerated Large-Eddy Simulation of Turbulent Channel Flows

    Get PDF
    High performance computing clusters that are augmented with cost and power efficient graphics processing unit (GPU) provide new opportunities to broaden the use of large-eddy simulation technique to study high Reynolds number turbulent flows in fluids engineering applications. In this paper, we extend our earlier work on multi-GPU acceleration of an incompressible Navier-Stokes solver to include a large-eddy simulation (LES) capability. In particular, we implement the Lagrangian dynamic subgrid scale model and compare our results against existing direct numerical simulation (DNS) data of a turbulent channel flow at Reτ = 180. Overall, our LES results match fairly well with the DNS data. Our results show that the Reτ = 180 case can be entirely simulated on a single GPU, whereas higher Reynolds cases can benefit from a GPU cluster

    Active memory controller

    Full text link
    Inability to hide main memory latency has been increasingly limiting the performance of modern processors. The problem is worse in large-scale shared memory systems, where remote memory latencies are hundreds, and soon thousands, of processor cycles. To mitigate this problem, we propose an intelligent memory and cache coherence controller (AMC) that can execute Active Memory Operations (AMOs). AMOs are select operations sent to and executed on the home memory controller of data. AMOs can eliminate a significant number of coherence messages, minimize intranode and internode memory traffic, and create opportunities for parallelism. Our implementation of AMOs is cache-coherent and requires no changes to the processor core or DRAM chips. In this paper, we present the microarchitecture design of AMC, and the programming model of AMOs. We compare AMOs\u27 performance to that of several other memory architectures on a variety of scientific and commercial benchmarks. Through simulation, we show that AMOs offer dramatic performance improvements for an important set of data-intensive operations, e.g., up to 50x faster barriers, 12x faster spinlocks, 8.5x-15x faster stream/array operations, and 3x faster database queries. We also present an analytical model that can predict the performance benefits of using AMOs with decent accuracy. The silicon cost required to support AMOs is less than 1% of the die area of a typical high performance processor, based on a standard cell implementation

    Optimization of high-performance superscalar architectures for energy efficiency

    Full text link

    Parallel Solution of Recurrence Problems

    No full text
    Abstract:. An mth-order recurrence problem is defined as the computation of the sequence x,;.., xN, where xi =f(ai, xi-,;. and ai,is some vector of parameters. This paper investigates general algorithms for solving such problems on highly parallel computers. We show that if the recurrence functionfhas associated with it two other functions that satisfy certain composition properties, then we can construct elegant and efficient parallel algorithms that can compute all N elements of the series in time proportional to [log,N]. The class of problems having this property includes linear recurrences of all orders- both homogeneous and inhomogeneous, recurrences involving matrix or binary quantities, and various nonlinear problems involving operations such as computation with matrix inverses, exponentiation, and modulo division

    The energy complexity of register files

    No full text
    Register files represent a substantial portion of the energy budget in modern processors, and are growing rapidly with the trend towards larger Instruction Level Parallelism (ILP). The energy cost of a register file access depends greatly on the register file circuitry used. This paper compares various register file circuitry techniques for their energy efficiencies, as a function of the architectural parameters such as the number of registers and the number of ports. The Port Priority Selection technique combined with differential reads and low-swing writes was found to be the most energy efficient and provided significant energy savings compared to traditional approaches in the case of large register files. The dependence of register file access energy upon technology scaling is also studied. However, as this paper shows, it appears that none of these will be enough to prevent centralized register files from becoming the dominant power component of next-generation superscalar computers, and alternative methods for inter-instruction communication need to be developed

    Should we worry about memory loss?

    Get PDF
    In recent years the High Performance Computing (HPC) industry has benefited from the development of higher density multi-core processors. With recent chips capable of executing up to 32 tasks in parallel, this rate of growth also shows no sign of slowing. Alongside the development of denser micro-processors has been the considerably more modest rate of improvement in random access memory (RAM) capacities. The effect has been that the available memory-per-core has reduced and current projections suggest that this is still set to reduce further. In this paper we present three studies into the use and measurement of memory in parallel applications; our aim is to capture, understand and, if possible, reduce the memory-per-core needed by complete, multi-component applications. First, we present benchmarked memory usage and runtimes of a six scientific benchmarks, which represent algorithms that are common to a host of production-grade codes. Memory usage of each benchmark is measured and reported for a variety of compiler toolkits, and we show greater than 30% variation in memory high-water mark requirements between compilers. Second, we utilise this benchmark data combined with runtime data, to simulate, via the Maui scheduler simulator, the effect on a multi-science workflow if memory-per-core is reduced from 1.5GB-per-core to only 256MB. Finally, we present initial results from a new memory profiling tool currently in development at the University of Warwick. This tool is applied to a finite-element benchmark and is able to map high-water-mark memory allocations to individual program functions. This demonstrates a lightweight and accurate method of identifying potential memory problems, a technique we expect to become commonplace as memory capacities decrease

    Optimization of high-performance superscalar architectures for energy efficiency

    No full text
    In recent years reducing power has become a critical design goal for high-performance microprocessors. This work attempts to bring the power issue to the earliest phase of high-performance mi-croprocessor development. We propose a methodology for power-optimization at the micro-architectural level. First, major targets for power reduction are identified within superscalar microarchi-tecture, then an optimization of a superscalar micro-architecture is performed that generates a set of energy-efficient configura-tions forming a convex hull in the power-performance space. The energy-efficient families are then compared to find configurations that dissipate the lowest power given a performance target, or, conversely, deliver the highest performance given a power budget. Application of the developed methodology to a superscalar micro-architecture shows that at the architectural level there is a potential for reducing power up to 50%, given a performance requirement, and for up to 15 % performance improvement, given a power bud-get
    corecore